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Abstract 

The problem of constructing an optimal rooted phylogenetic network from a set of rooted triplets 
is an NP-hard problem. In this paper, we present a heuristic algorithm called TripNet which tries to 
construct an optimal rooted phylogenetic network from an arbitrary set of triplets. We prove some 
theorems to justify the performance of the algorithm. 

Index Terms 

Rooted phylogenetic network, Rooted triplet, Quartet, Directed acyclic graph, Height function. 



I. Introduction 

Phylogenetic networks are a generalization of phylogenetic trees that permit the representation 
of non-tree-like underlying histories. A rooted phylogenetic network is a rooted directed acyclic 
graph in which no nodes has indegree greater than 2 and the outdegree of each node with indegree 
2 is 1. Such nodes are called reticulation nodes. Mathematicians are interested in developing 
methods that infer a phylogenetic tree or network from basic building blocks. In the computation 
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of a rooted tree or network, one group of the basic building blocks are triplets, the rooted binary 
trees on three taxa (1). In 1981, Aho et al., studied the problem of constructing a tree from a 
set of triplets (2). They proposed an algorithm called the BUILD algorithm which shows that, 
given a set of triplets, it is possible to construct in polynomial time a rooted tree that all the 
input triplets are contained in it or decide that no such tree exists. When there is no tree for a 
given set of triplets one may try to produce an optimal phylogenetic network. In this context, the 
goal is to compute an optimal rooted phylogenetic network that contains all the rooted triplets. 
One possible optimality criterion is to minimize the level of the network, which is defined as the 
maximum number of reticulation nodes contained in any biconnected component of the network. 
The other optimality criterion is to minimize the number of reticulation nodes (1). In (3) and 
(4) the authors considered the problem of deciding whether, given a set of triplets as input, is 
it possible to construct a level- 1 phylogenetic network that contains all the input triplets? They 
showed that, in general, this problem is NP-hard. However, in (4) the authors showed that when 
the set of triplets is dense, which means that for each set of three taxa there is at least one triplet 
in the input set, the problem can be solved in polynomial time. After their results, all research in 
this new area has up to this point focused on constructing networks from dense triplet sets. The 
algorithm by (5) can be used to find a level- 1 or a level-2 phylogenetic network which minimizes 
the number of reticulation nodes if such a network exists. In (6) the authors showed that given 
a dense set of triplets r and a fixed number k, it is possible to construct in time 0(|r| fc+1 ) a 
level-A; phylogenetic network consistent with r or decides that no such network exists. 

In this paper we present a heuristic algorithm called TripNet for constructing phylogenetic 
networks from an arbitrary set of triplets. Despite of current methods that work for dense set of 
triplets, a key innovation is the applicability of TripNet to non-dense set of triplets. The results of 
the TripNet algorithm on biological sequences is presented in (7). Here we prove some theorems 
to justify the performance of the algorithm. This paper is organized as follows. In section II we 
present some definitions and notation. In section III we discuss triplet construction methods. In 
section IV the directed graph G T related to a set of triplets r is introduced. In section V the 
concept of the height function of a tree is introduced, and we propose an algorithm to construct 
a tree from its height function. Then we generalize the concept of the height function to the 
networks. Finally, in section VI we present the TripNet algorithm. 
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II. Definitions and Notation 

Let X be a set of taxa. A rooted phylogenetic tree {tree for short) on X is a rooted unordered 
leaf labeled tree whose leaves are distinctly labeled by X and every node which is not a leaf 
has at least outdegree two. A directed acyclic graph (DAG) is a directed graph that is free of 
directed cycles. A directed acyclic graph G is connected if there is an undirected path between 
any two nodes of G. It is biconnected if it contains no node whose removal disconnects G. 
A biconnected component of a graph G is a maximal biconnected subgraph of G. A rooted 
phylogenetic network {network for short) on X is a rooted directed acyclic graph in which root 
has indegree and outdegree 2 and every node except the root satisfies one of the following 
conditions: 

a) It has indegree 2 and outdegree 1 . These nodes are called reticulation nodes. 

b) It has indegree 1 and outdegree 2. 

c) It has indegree 1 and outdegree 0. These nodes are called leaves and are distinctly labeled 
by X. 

A reticulation leaf is a leaf whose parent is a reticulation node. A network is said to be a 
level-k network if each of its biconnected component contains at most k reticulation nodes. A 
tree can be considered as a level-0 network. 

A rooted triplet {triplet for short) is a binary rooted unordered tree with three leaves. We 
use ij\k to denote the triplet with taxa i and j on one side and k on the other side of the root 
(Fig. 1(a)). A set of triplets r is called dense if for each subset of three taxa, there is at least one 
triplet in r. A triplet ij\k is consistent with a network N or equivalently N is consistent with 
ij\k if N contains a subdivision of ij\k, i.e. if N contains distinct nodes u and v and pairwise 
internally node-disjoint paths u—>i,u—>j,v—tu and v — > k. Fig. 1(b) shows an example of 
a network which is consistent with ij\k. A set r of triplets is consistent with a network N if all 
the triplets in r are consistent with N. We use the symbols r(N) and to represent the set 
of all triplets that are consistent with N and the set of labels of its leaves respectively. For any 
set r of triplets define L{t) = U 4er L t . The set r is called a set of triplets on X if L{r) = X. 

III. Triplets construction method 

There are two main tree construction methods, character-based methods and distance-based 
methods. In character-based methods, the information of a set X of biological sequences is 
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(a) 



(b) N 



(d) 



(c) 



Fig. 1. (a) Triplet ij\k. (b) Triplet ij\k is consistent with the network N. (c) The steps of removing edges with maximum 
weight from a network, (d) quartet ij\ko with its inner edge, (e) A counter example for the reverse of Theorem 5. 

directly used for producing final tree. In distance-based methods first a distance matrix D is 
computed from X and then a rooted (or unrooted) tree T is constructed from D (1). 

A weighted tree (T, w) is a rooted (or unrooted) tree T together with a function w : E(T) — > R. 
We call w(e) the weight of the edge e. For any two nodes i and j of T, let Uj denotes the unique 
path in T from i to j. Define 



If T is an unweighted tree then we suppose that for each edge e in T, w(e) — 1. Given a set 
of taxa X, let (T, w) be a weighted tree on X and be a matrix in which the entry of row i 
and column j is d T (i,j). We call D T the distance matrix related to (T,w). 

A quartet is a binary unrooted tree with four leaves. We use the symbol ij\kl for a quartet 
on the set of taxa k, 1} which have neighbor pairs i,j and k,l. In a quartet Q there is a 
unique edge such that its two end points are not leaves. We call this edge the inner edge of Q ( 
Fig. 1(c) ). A weighted quartet is called informative, if the weight of its inner edge is positive. 
The following proposition holds for informative quartets. 

Proposition 1. Given a set of four taxa X = {i,j,k,l} and a distance matrix D on X. For 
an informative quartet ij\kl, the equation d(i,j) + d(k, I) < d(i, k) + d(j, I) = d(j, k) + d(i, I) 
holds. 
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The following proposition also holds (9). 

Proposition 2. Given a set of four taxa X and a distance matrix D on X. There is a unique 
quartet Q on X for which Dq = D. 

Suppose that X is a set of taxa in which each taxon is a biological sequence. Let ox be an 
outgroup for X and Dbea distance matrix on X U {ox}- In this paper to obtain a set of triplets, 
we use the method introduced in (5). For each set of four taxa that contains ox, we construct the 
unique quartet which satisfies Proposition 2. Then, by removing ox from informative quartets 
we obtain the set of triplets. In the rest of paper we refer to this method By QOT 

IV. The directed graph related to a set of triplets 

Let r be a set of triplets. Define G T , the directed graph related to r, by V(G T ) = { {i, j} : i, j £ 
L(r),i 7^ j} (we denote {i, j} by ij for short) and E(G T ) = {(ij,ik) : ij\k £ t} U {(ij,jk) : 
ij\k £ r}. The graph G T has an important role in the remaining of the paper and in this section 
we prove some basic properties of G T . 

Let X be a set of sequences, D = [d(i, j)} be a distance matrix on X where for any pair i, j £ 
X, d(i, j) denotes the distance between them, and r be the set of triplets that is produced by QOT 
method. Here we define the concept of the closure of r. If ij\k and js\i are in r, then we have the 
quartets ij\kox and js\iox- According to the Proposition 1, d(i,j)+d(k, ox) < d(j, k)+d(i, ox), 
d(j, s) + d(i, o x ) < d(i,j) + d(s, o x ), and therefore d(j, s) + d(k, o x ) < d(j, k) + d(s, o x )- It 
means that we should have the quartet js\kox- The equivalent triplet for this informative quartet 
is js\k. If this triplet is not in r add it to r and continue this procedure until one cannot add 
more triplets. We use the symbol r to show this new set of triplets and call it the closure of r. 

The following lemma is an immediate consequence of the definition of r. 

Lemma 1. Let X be a set of sequences and r be the set of triplets which is produced by the 
QOT method. Then r contains at most one triplet for each {i,j, k} C X. 
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Now we state the main results of this section. 

Theorem 1. Let X be a set of sequences and r be the set of triplets which is produced by the 
QOT method. Then G T is a DAG. 

Proof: We prove a stronger result and show that G T is a DAG. The proof proceeds by 
induction on the length of the shortest cycle in First we prove that G- dose not contain any 
cycle of length 3. Assume that C is a cycle of length 3 in G T . Let (ij, ik) be an edge of C. The 
triplet which corresponds to this edge is ij\k. Suppose that the third node of the cycle is st. Thus 
the other edges of the cycle are (st,ij) and (ik,st). So \{i,j} H {s,t}\ = \{i,k} D {s,t}\ = 1. 
There are two cases. Case 1: s — % and t ^ j, k (or t — % and s ^ j, k). Case 2: s = j and t = k 
(or s = k and t = j). For the first case, the edges of C are (ij, ik), (ik, it) and (it, ij). The three 
quartets which are corresponds to the triplets of these three edges are ij\kox, ik\tox and it\jox- 
According to Proposition 1, we have three inequalities d(i,j) + d(k,ox) < d(i,k) + d(j,ox), 
d(i,k) + d(t,o x ) < d(i,t) + d(k,o x ) and d(i,t) + d(j,o x ) < d(i,j) + d(t,o x ). By summing 
up these inequalities, we obtain a contradiction. For the second case the edges of C are (ij,ik), 
(ik,jk) and (jk,ij). The three triplets corresponds to these three edges are ij\k,ik\j and jk\i 
which contradicts Lemma 1. So there is no cycle of length 3 in G T . Now assume that there is 
no cycle of length k > 3 in G- and C be a cycle of length k + 1 in it. First we claim that there 
is no path Sxs 2 — > s 3 s 4 — > s 5 s 6 in C such that |{s a , s 2 } D {s 3 , s 4 } PI {s 5 , s 6 }| = 1. Suppose that 
there exists such a path. So this path is of the form js — > ji — > jk and triplets js\i and zj|A; are 
in r. The method of constructing r implies that js|A; is in r and the edge js —tjk is in So 
we obtain a cycle of length k in a contradiction. 

Let S1S2 be a node of C. There exists a node S3 such that the edge sis 2 — > S1S3 is in C and 
S1S3 is connected to a node S1S4 or s 3 s 4 . If s 4 s 4 G V(C) then sis 2 — > s 4 s 3 — > s 4 s 4 will be in C 
which contradicts the above claim. So the cycle C is of the form S1S2 — » S1S3 — >■ S3S 4 —>•...—>■ 
s fc s fc+ i -)■ Sfe+iSfc+2 -> sis 2 . For the edges s k+1 s k+2 -> sis 2 we obtain \ {si, s 2 }n{s k+1 , s k+2 }\ = 
1. For the cases = s t ,l <E {1,2} or s fc+2 = Si we have a cycle of length k in G T . So 
s/0+2 = S2 and triplets s^l^, Sis 3 |s 4 , s 3 s 4 |s 5 , . . ., s k _is k \s k+1 , s k s k+ i\s 2 and s k+1 s 2 \s 1 are in 
t. Equivalently, we have the following inequalities: 
d(si, s 2 ) + d(s 3 , o) < d(s 1 , s 3 ) + d(s 2 , o), 
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d(si, s 3 ) + d(s 4 , o) < d(s 3 , s 4 ) + d(si, 6), 

d(s k , s k+1 ) + d(s 1 ,o) < d(s k+1 , si) + dO fc , o), 
d(s fc+ i, si) + d(s 2 , o) < s 2 ) + d(s k+1 , o). 

Summing these inequalities, we obtain a contradiction. So there is no cycle of length k + 1 in 
Or. U 

Let r be a set of triplets that is consistent with a tree. Let T T denotes the unique tree that is 
produced by the BUILD algorithm. 

Theorem 2. Let r be a set of triplets that is consistent with a tree. Then G T is a DAG. 

Proof: The proof proceeds by induction on |L(r)|. It is trivial when |L(r)| = 3. Assume 
that theorem holds when \L(t)\ < k. Let \L(r)\ — k + 1 and Ti, T 2 , . . . , T m be m subtrees that 
are obtained from T T by removing its root. For each i, 1 < % < m, let = t\l t . denotes the set 
of all triplets in r whose leaves are in L T .. By the induction assumption for each i, 1 < i < m, 
G n is a DAG. Let r' = [j n and G' = [j G n . Apparently, G' is a DAG and G T > = G' . 

l<i<m l<i<m 

The graph G T can be obtained from G' by adding the nodes which belong to V(G T ) \ V(G') 
and the edges corresponds to the triplets in r \ r'. If a triplet t = ab\c E t\t j then there is 
I < i < j < m such that a, b E Lfc) and c E L(tj). It means that the edges corresponds to 
the triplets in r \ t' are of the form (ab,ac) such that ac E V(G T ) \ V(G'). So all nodes in 
V(G T ) \ V(G') has outdegree zero and the edges in G T \ G' are from V{G') to V(G T ) \ V(G'). 
Now if there exists a cycle in G T it sholud contain a node in V(G T ) \ V(G') which contradicts 
that these nodes have outdegree zero and the proof is complete. ■ 

V. Height function 

In this section first the concept of the height function of a tree and a DAG is introduced and 
then the BUILD algorithm is restated based on this concept. 
Let (^) denotes the set of all subsets of X of size 2. 

Definition 1. Let X be an arbitrary finite set. A function h : ( 2 ) — > N is called a height function 
on X. 
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Let T be a rooted tree with the root r, be the lowest common ancestor of the leaves % and 
j, and l T denotes the length of the longest path started from r. 

Definition 2. The height function ofT, Ht is defined as hj<{i,j) = It — ^r(cjj; r ) where i and 
j are two distinc leaves ofT. 

The following theorem represents the relation between the height function of a tree and a 
triplet consistent with it. 

Theorem 3. Let T be a tree. A triplet ij\k is consistent with T if and only ifhr{i,j) < hr(i, k) 
or h T (i,j) < h T (j, k). 

Proof: Let ij\k be consistent with T. By definition hr(i,j) < hx(i,k) and liT(i,j) < 
hr(j, k). Now suppose that for the three arbitrary leaves i,j and k, we have hx{i,j) < hx{i, k) 
or h T {i,j) < h T (j, k). Without loss of generality suppose that h T (i,j) < h T (i, k). Since and 
Cik are on a unique path from the root r to i and dr(cy,r) > rfr(cjfc, r), thus there is a path 
from the lowest common ancestor of i, k to the lowest common ancestor of i,j which follows 
that ij\k is consistent with T. ■ 

Let r be a set of triplets, G T be a DAG and 1 Gt denotes the length of the longest path in G T . 
Since G T is a DAG, the set of nodes with outdegree zero is nonempty. Assign Iq t + 1 to the 
nodes with outdegree zero and remove them from G T . Assign Iq t to the nodes with outdegree 
zero in the resulting graph and continue this procedure until all nodes are removed. 

Definition 3. For any two distinct i,j G L{r), define ha T (i,j) as the value that is assigned by 
the above procedure to the node ij and call it the height function related to G T . 

Let r be a set of triplets that is consistent with a tree. By Theorem 2, G T is a DAG and ha T 
is well-defined. The following theorem represents a method to obtain hr T from r using ha r . 

Theorem 4. Let r be a set of triplets which is consistent with a tree. Then hc T = hx T - 

Proof: The proof proceeds by induction on \L Tt \. It is trivial when \Lr T \ = 3. Assume that 
theorem holds when \L Tt \ < k. Let \L Tt \ = k + 1 and T l7 T 2 , . . . , T m be m subtrees which are 
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obtained from T r by removing its root. For each i, 1 < % < m let 7$ = t| Lt , and be the root 
of Tj. By the induction assumption for each i, 1 < % < m, h Gr = /i Tt . ■ Moreover we conclude 
from the BUILD algorithm that Tj = T T ., for 1 < i < m. Thus h Gr = h T ., for 1 < i < m. So 
for i, 1 < i < m, the maximum length of the longest path in Tj is l Tr — 1. It means that for i, 
1 < i < m, the maximum length of the longest path in G n is l Tr — 2. Therefore by the proof of 
Theorem 2 the length of the longest path in G T is l Tr — 1. Let a, b G L Tr . We have two cases. 

Case 1: For some i and j,l<i<j<m, a E and b G L^. Since the outdegree of ab 
in G T is zero and c a b = r, then h TT (a, b) = It t = h GT (a, b). 

Case 2: For some i, 1 < i < m, a, b G Lt 4 . By the induction assumption h GT ,(a,b) = 
h Tr . (a, b). Therefore h Tr {a, b) = 1 Tt - d TT (c ab , r) = 1 Tt - (d TT . (c ab , r») + 1) = (1 Tt - h Ti - 1) + 
(l Tri ~ d Tn (c o6 , ri)) = (1 Tt - h Ti - 1) + h Tr . (a, &) = (Zr T - lr n ~ 1) + ^ (a, &) = (a, The 
last equality is obtained by construction of G T from G n which is stated in the proof of Theorem 
2. So for each a, b G L Tr , h Tr (a, b) = h Gr (a, b) and the proof is complete. ■ 

Now we describe an algorithm similar to BUILD algorithm, using height function. We refer 
to this algorithm by HBUILD. Let h be a height function on X. Define a weighted complete 
graph (G,h) where V(G) = X and edge {i,j} has weight h(i,j). Remove the edges with 
maximum weight from G. If removing these edges results a connected graph the algorithm 
stops. Otherwise, the process of removing the edges with maximum weight is continued in each 
connected component until each connected component contains only one node. At the end of 
this procedure one can reconstruct the tree by reversing the steps of the algorithm similar to 
BUILD algorithm (see Fig. 2). The algorithm above decides in polynomial time whether a tree 
with height function h exists. 

Now if r is a set of triplets which is consistent with a tree, by Theorems 2 and 4, G T is a 
DAG and h Gr = h Tr = h and HBUILD algorithm constructs T T . 

Now we generalize the concept of height function from trees to networks. This generalization 
is not straightforward because the concept of (lowest) common ancestor of two leaves of a 
network is not well-defined. Let be a network with the root r and In be the length of the 
longest directed path from r to the leaves. For each node u consider d(u,r) as the length of 
the longest directed path from r to u. For any two nodes u and v, we call u an ancestor of v, 
if there exists a directed path from u to v. If u is an ancestor of v then we say that v is lower 
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(a) (b) (c) (d) 

Fig. 2. The steps of constructing T T from the given set r = {kl\j, kl\i, jk\i, jl\i}. (a) The graph G T . (b) The graph (G,h). 
(c) Removing maximum weights from the graph (G, h). (d) Constructing T T using step c. 

than u. Lowest common ancestor of two leaves in a network is not necessarily unique. For any 
two leaves i and j, let CV, be the set of all lowest common ancestors of i and j. 

Definition 4. For each pair of leaves i and j, define hi<i{i,j) = min{lN — d(c, r) : c G CV, } and 
call it the height function of N. 

Obviously, every network N indicates a unique height function hjy. But two different networks 
may have the same height function (see Fig. 3(a)) . 




(a)T N (b)Ti T 2 

Fig. 3. (a) Two different networks with the same height function. Hn — hr = h. h(j, k) = 1, h(i,j) = h(i, k) = 2 and 
h(i, I) = h(j, I) = h(k, I) — 3. (b) T2 is a binarization of Ti. 

In the following proposition we prove that for a given height function h there is a network 
N such that h N = h. 

Proposition 3. Let X be an arbitrary finite set and h be a height function on X. Then there 
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exists a network N not necessarily binary, such that h N = h. 

Proof: Let h max = max{h{x,y} : x,y £ X}. Let r be the root of N. For each pair of 
nodes x and y with h(x,y) = h rnax , we connect x and y to r. For each pair of nodes x and 
y with h(x,y) < h max , we consider a new node and connect this node to r by h max — h(x,y) 
edges. By deleting multiple edges we obtain a network iV with h N = h. ■ 

The following theorem shows the relation between height function of a network and the triplet 
consistency with it. 

Theorem 5. Let N be a network, and k be its three distinct leaves. Ifh^(i,j) < h-N(i,k) 
or h N (i,j) < /ijv(i, k) then ij\k is consistent with N. 

Proof: Suppose that h N (i,j) < h N (i,k). Let Vij and v ik be common ancestors of i,j and 
i,k respectively, such that h N (i,j) = l N — d(vij,r) and h N (i,k) = l N — d(v ik ,r). Let k and 
lj be two distinct paths from to i and j, respectively. Let l k be an arbitrary path from v ik 
to k. If U n l k 7^ then it follows that h N (i,j) > h N (i, k) which is a contradiction. So ij\k is 
consistent with N. ■ 

The reverse of the above theorem is not necessarily true. For example, consider the network 
of Fig. 1(e) . The triplet ij\k is consistent with it, but h(i,j) = h(i, k) = 3 and h(j, k) = 2. 

The basic idea of the TripNet algorithm is to find a height function as an intermediate 
computational step that yields the minimum amount of information required to construct the 
network from a set of triplets. So it is important to find a way for computing h N from a set of 
triplets. In the rest of this section we introduce a computational method for computing h N using 
Integer Programming. Let r be a set of triplets with |L(r)| = n. Inspiring from Theorems 3 and 
5, for each triplet ij\k £ r, define two inequalities h(i, k) — h(i,j) > 1 and h(j, k) —h(i,j) > 1. 
Since the number of variables in such inequalities are at most c(n, 2), we obtain the following 
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system of inequalities from r. 



h(i,k)-h(i,j) > 1 



ij\k G r, 



h(j,k)-h(i,j) > 1 



ij\k G r, 



< h(i,j) < c(n,2) 



1 < z,j < n. 



Let s be an integer. Define the following Integer Programming and call it IP(r, s). 



Maximize 



l<i,j<n 



Subject to : 



,h(i,k) - h(i,j) > 1 



ij\k G r, 



/i(i,fc)-^,j)>l 



< /i(z, j) < s 



1 < < n. 



Intuitively if IP(r, s) has a feasible solution, we expect that the optimal solution to this integer 
programming is an approximation of the height function of an optimal network N consistent 
with r. The following theorems support this intuition. 

Theorem 6. Let r be a set of triplets. Then G T is a DAG if and only if for some integer s, the 
IP( t,s) has a feasible solution. In this case the minimum number s, for which IP( t,s) has a 
feasible solution, is Iq t + 1. 

Proof: Let G T be a DAG. Without loss of genrerality assume that G T is connected. The 
proof proceeds by induction on Iq t . If Ig t = 1 then obviously for s — 1, EP(r, s) has no feasible 
solution and for each s > 2, IP(r, s) has a feasible solution. Assume that the theorem holds for 
Ig t < k. Suppose that r is a set of triplets with Iq t = k + 1. Let A be the set of the terminal 
nodes of all longest paths in G T . For each ij G A there is some x G L(r) such that ix\j G r. 
Let B be the set of all such triplets and r' = r\B. Apparently, B ^ and the length of the 
longest path in G T > is k. By the induction assumption the minimum number s for which EP(r', s) 
has a feasible solution, is Iq , +1 = Ig t - Consider EP(r, la r + 1)- Define h(i,j) = Iq t + 1, for 
each ij G A and h(t, I) = h'(t, I), for each tl A. h is a feasible solution to IP(r, /g t + 1). Now 
if s be a solution for IP(r, s) then s — 1 is a solution for IP(r', s — 1). So Iq t + 1 is minimum 
soltion for IP(r, s). Now suppose that r is a set of triplets and for some integer s, EP(r, s) has 
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a feasible solution h. Assume that G T has a cycle C = — > i 2 j 2 W m — > hji- 

Corresponds to C we have inequalities ji) < h(i 2 ,j 2 ) < ■ ■ ■ < h(i m ,j m ) < h(ii,ji) which 
is a contradiction and the proof is complete. ■ 

Let r be a set of triplets consistent with a tree. By Theorems 2, 3, and 6, hj- T is a feasible 
solution to IP(r, 1 Gt + 1). In the following theorem we prove the uniqueness of this solution. 

Theorem 7. Let t be a set of triplets consistent with a tree. Then Jit t is the unique optimal 
solution to IP( t, Iq t + 1 ). 

Proof: By Theorem 2, G T is a DAG. So Iq t is well defined. The proof proceeds by induction 
on 1 Gt . Without loss of genrerality assume that G T is connected. The theorem is trivial when 
Iq t — 1. Let for each set of triplets consistent with a tree, h Tr be the unique optimal solution to 
EP(t, Iq t + 1) where Iq t = k > 1. Suppose that r is a set of triplets consistent with a tree and 
lc T = k + 1. Let t' be the set of triplets which is introduced in the proof of Theorem 6. By the 
induction assumption hr , is the unique optimal solution to IP(r', Iq , +1)- By Theorem 6 the 
minimum s for which IP(r, s) has a feasible solution is 1 Gt + 1. Also , + 1 = / Gr . It follows 
that hx T is the unique optimal solution to the IP(r, Iq t + 1) and the proof is complete. ■ 

The BUILD tree is not necessarily a binary tree. To obtain a binary tree consistent with a set 
of triplets we do the following procedure. 

Let T be a tree and x be a node of T with xi, x 2 , ■ ■ ■ , Xk, k > 3 as its childs. Consider a new 
node y. Construct T' by removing the edges (x, Xi), (x,x 2 ), ■ ■ ■ , from T and adding 

the edges (x, y), (y, xi), (y, x 2 ), . . . , (y, x^-i) to T. Continuing the same method for each node 
with outdegree more than 2 we obtain a binary tree which we call it a binarization of T (see Fig. 
3(b)). Obviously, we can obtain different binarization of T. The proof of the following theorem 
is easy and we omit it. 

Theorem 8. Let r be a set of triplets that is consistent with a tree T\, and T 2 be a binarization 
ofT\. Then r is consistent with T 2 . 
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VI. TripNet algorithm 

Now we describe the TripNet algorithm in nine steps. In this algorithm the input is a set of 
triplets r and the output is a network consistent with r. Also if r is consistent with a tree the 
algorithm constructs a binarization of T T . 

Step 1: In this step we find a height function h on L(r). If G T is a DAG we set G' T = G T . If 
G T is not a DAG we remove some edges from G T in such a way that the resulting graph G' T is 
a DAG. Set h = h G > r . 

If r is obtained from biological sequences using the QOT method, then Theorems 2 shows 
that G T is a DAG. Removing minimum number of edges from a directed graph to make it a 
DAG is known as the minimum Feedback Arc Set problem which is NP-hard (10). Thus, using a 
greedy algorithm, we try to remove as minimum number of edges as possible from G T in order to 
lose minimum information. However, any such missing information will be recaptured in Step 9. 

Step 2: In this step TripNet first apply HBUILD on h. If the result is a tree, TripNet constructs 
a binarization of this tree. Otherwise TripNet goes to Step 3. 

Note that if r is consistent with a tree, by Theorem 4, h Gr = hr T and TripNet constructs a 
binarization of T T . 

Step 3: Remove all the maximum-weight edges from G. The process of removing all the 
maximum-weight edges from the graph continues until the resulting graph is disconnected. 

In (3) and (4) the authors introduced the concept of SW-sets for a set of triplets r. A subset S of 
L(t) is an SN-set if there is no triplet ij\k £ r such that i £ S and j, k £ S. In (4) it is shown 
that if r is dense then the maximal SW-sets partition L(r) and can be found in polynomial 
time. By contracting each of the SN-set to a single node and assuming a common ancestor for 
all of these leaves, the size of the problem is reduced. In these papers, for finding the maximal 
SN '-sets in polynomial time, the authors use the high density of the input triplet sets. TripNet 
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algorithm uses the concept of height function as an auxiliary tool to obtain STV-sets instead of 
the high density assumption. 

Step 4: For each connected component obtaining in Step 3, which is not an SN-set we apply Step 
3. This process continues until all of the resulting components are SW-sets. Let {S\, S 2 , . . . , Sk} 
be the set of resulting SN -sets. If each SW-set contains only one node, HBUTLD is applied and if 
the result is a tree TripNet constructs a binary tree and goes to Step 6. Otherwise TripNet goes to 
Step 5. If for some i, \Si\ > 1, contract each Sj to a single node s, and set S = {si, s 2 , . . . , s k }. 
Update the set of triplets by defining t s = {siSj\sk : if 3 xy\z G r, x G Si, y G Sj and 
z G Sk}. Constructs a weighted complete graph (Gs,ws) with V(Gs) = S and ws(si,Sj) = 
mm{h(x, y) : x <E Si and y G Sj}. Set (G, w) = (G s , w s ) and TripNet goes to Step 3. 

The following theorem is consequence of definition SN -set for (Gs,ws). 

Theorem 9. Applying Steps 3 and 4 on (Gs, w$) and Tg, each resulting SN-set has one member. 

Proof: Suppose that S = {s 1 , s 2 , s 3 , . . . , s r } is an SN-set in (Gs, w s ). Now assume that in 
the procedure of Step 3 by removing the edges with weight I, Si separates from S 2 . Thus there 
exists k > I such that by removing the edges with weight at least k in (Gs,ws), the connected 
component S seperates from other components of Gs. It means that by removing the edges with 
weight at least k in G, we obtain the SN-set S± U . . . U S r in r which is a contradiction. ■ 

In the next step the reticulation leaves are recognized using the following three criteria: 

Criterion I: Let and Mj be the minimum and maximum weight of the edges in (G, h) 
with exactly one end in Si. Choose the node with minimum rrii and if there is more than one 
node with minimum m, then choose among them the nodes which has minimum Mj. Let R\ 
denotes the set of such nodes. 

Criterion II: Let w min = min{w(si, Sj) : 1 < i, j < k}. In Gs consider the induced subgraph 
on the edges with the weight w min . Choose the nodes of R\ with the maximum degree in this 
induced subgraph. Let R 2 denotes the set of such nodes. 

Criterion III: For each node s G R 2 , remove it from Gs and find 5*iV-sets for this new graph 
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using Steps 3 and 4. Let n s be the number of SW-sets of this new graph with cardinality greater 
than one. Choose the nodes in R 2 with maximum n s . Let i? 3 denotes the set of such nodes. 
We state an example to show the idea behind these three criteria. 

Let r = jk\i, kl\j, kl\i, no\m, lo\k, jl\o, mn\l, mn\j, no\k, mo\i, jk\n,ij\o,ik\m,il\n}. r 
is not consistent with a tree but it is consistent with the network N shown in Fig. 4(a). Obviously, 
N is an optimal network consistent with r. In order to find SW-sets we construct G' T and 
(G, h), and find SN -sets from (G, h) using Steps 3 and 4 (Figs. 4(b) to 4(g)). It follows that 
$ = {{0> {j}> {^}> { m }) { n i °}}- Now in Gs (Fig 4(h)). we expect that the reticulation is 
in R 1 . In this example both k and I are in R 1 . Also we expect that if there is a reticulation leaf, 
it belongs to R 2 which again both k and / are in R 2 . Now just I belongs to R 3 . Thus we consider 
I as the reticulation leaf (Figs 4(i) to 4(n)). Remove triplets from r s which contain I and denote 
the new set of triplets by r' s . Obviously, r' s is consistent with a tree. We add this reticulation 
leaf to a binarization of T T > s such that the resulting network is consistent with r s . Note that if 
we consider each node except than / as the reticulation leaf then final network consistent with 
t s has at least two reticulation leaves. 

Step 5 : In this step the reticulation leaf is recognized using three criteria. Do the criterion 
I. If = 1 then choose the node x E R\ as the reticulation node. Otherwise if \R\\ > 1 do 
the criterion II. If \R 2 \ = 1 then choose the node x E R 2 as the reticulation node. Otherwise if 
|-R 2 | > 1 do the criterion III. If \R 3 \ = 1 then choose the node x E R% as the reticulation node. 
Otherwise if |i? 3 | > 1 then by the speed options we choose the reticulation node as follows. 

Slow. Each node in R 3 is examined as the reticulation leaf. 

Normal: Two nodes in R 3 are selected randomly and each of these two nodes is examined as 
the reticulation leaf. 

Fast: One node in _R 3 selected randomly as the reticulation leaf. 

Let x be a node which is considered as a reticulation leaf. Remove x from Gs and all of the 
triplets which contain x from r s . Define G = G \ {x} and go to Step 3. 

Note that for the Fast option the running time of the algorithm is polynomial. For biological data 
almost always the criteria I and II find a unique reticulation leaf. So on real data the running 
time of TripNet is almost always polynomial. 
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(a) N (b) G' T is obtained from G T by removing the (c) (G, h) (d) 

dotted line. 




Fig. 4. Edges with weight 6 are shown by dotted lines. 



Step 6 : Let x\, x 2 , . . . , x m be m reticulation leaves which are obtained in Step 5 with this 
order and T be the tree that is constructed in Step 4. Now add these m nodes in the reverse 
order to T as what follows. Let ei and e 2 be two edges of T. Consider two new nodes y\ and 
?/2 in the middle of e\ and e 2 . Connect y\ and y 2 to a new node y% and connect the reticulation 
leaf x m to 1/3. Do this procedure for all pairs of edges and choose a pair such that the resulting 
network is consistent with maximum number of triplets in r. Triplet continue this procedure 
until all the reticulation nodes are added. 
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Step 7 : For each SN '-set, S if and the set t s . of triplets we run the algorithm again. 

Step 8 : Replace each SN -set in the network of Step 6 with its related network constructed in 
Step 7 to obtain a network N'. 

Let t' G t be the set of the triplets which are not consistent with N'. For each pair of leaves a 
and b assume that r' ab is the set of triplets in r' which are of the form ab\c. Consider the pair of 
leaves i and j such that r[- has the maximum cardinality. Assume that p^ and pj are the parents 
of i and j, respectively. 

Step 9 : Create two new nodes in the middle of the edges p4 and pjj and connect them 
with a new edge. This new edge creates a reticulation node and all of the triplets in t[- will be 
consistent with the new network. All consistent triplets with the new network are removed from 
t' and this procedure will continue until r' becomes empty. 

Fig. 5 presents an example of the algorithm with all of its Steps. 
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